Web Scraping of online newspapers via Image Matching
نویسندگان
چکیده
Reading is an activity which takes place widely on the web: every newspaper in the world has his own digital version on the internet and there are even a lot of magazines that exist only on the web. In such a scenario, Computer Vision can offer a useful set of tools that can help web editors to improve the quality of the provided service. One of these tools is here presented: given a webpage of a newspaper or journal, the proposed framework localizes news items remotely clicked by users, giving the bounding box of the content of an article in its relative homepage. The tool is hence able to track an article in the page in which is contained at any time during the day: such an information is very useful for web editors to understand the trend of the published items and to rearrange the contents of the homepage accordingly. The system has been developed with an hybrid approach: first we manipulate the HTML source of the homepage in order to generate a visual template for the news item we want to localize. A visual template for a news item is an HTML document containing only the displayed content of the article in the homepage, that is the text of the title and the image attached to the article, if any. The algorithm which builds the template needs just the URL of the article. Thereafter, we take the screenshot of the template in order to get an image representing the news. On this image, we extract FAST keypoints [1], which fit well for the kind of images we are dealing with, that is almost blank images mainly composed of text. The keypoints extraction is done also for the screenshot of the homepage at system start time: therefore, we can now match the article image with the homepage image and get the bounding box of the news in its parent webpage. We provided four keypoint descriptors for the matching: SIFT [2], SURF [3], BRIEF [4], BRISK [5]. We also implemented articles localization via Template Matching, using the Normalized Correlation Coefficient as measure of similarity. In order to ensure the matching, we have to discard all the redundant blank content placed on the borders of the image, given by the fact the picture comes from a screenshot. We use the extracted keypoints for this purpose: since FAST keypoints are located only upon non blank parts of an image, we can use the bounding box containing them to cut the picture and remove the void borders. We tested the system by labelling the bounding boxes for all articles for a set of websites. Then, we compared the extracted rectangles with the ones of the ground truth. We compared the performances of Template Matching versus Keypoint Matching using the aforementioned descriptors. For each extracted bounding box, we measured the euclidean distance of its center from the corresponding rectangle center of the ground truth; for each test, we provide evaluation measures to assess the accuracy of the developed algorithm. Also, we provide HIT/MISS. The results depend on the layout of the websites and on the manner they arrange the contents in the homepage. Our tests show that good performances can be reached, giving also an interesting case of comparison among different keypoint descriptors.
منابع مشابه
Image flip CAPTCHA
The massive and automated access to Web resources through robots has made it essential for Web service providers to make some conclusion about whether the "user" is a human or a robot. A Human Interaction Proof (HIP) like Completely Automated Public Turing test to tell Computers and Humans Apart (CAPTCHA) offers a way to make such a distinction. CAPTCHA is a reverse Turing test used by Web serv...
متن کاملA procedure for Web Service Selection Using WS-Policy Semantic Matching
In general, Policy-based approaches play an important role in the management of web services, for instance, in the choice of semantic web service and quality of services (QoS) in particular. The present research work illustrates a procedure for the web service selection among functionality similar web services based on WS-Policy semantic matching. In this study, the procedure of WS-Policy publi...
متن کاملAn Improved Semantic Schema Matching Approach
Schema matching is a critical step in many applications, such as data warehouse loading, Online Analytical Process (OLAP), Data mining, semantic web [2] and schema integration. This task is defined for finding the semantic correspondences between elements of two schemas. Recently, schema matching has found considerable interest in both research and practice. In this paper, we present a new impr...
متن کاملOnline Branding in Newspapers: A Conceptual Model
Online media branding strategies is growing at a very fast pace and is increasingly adopted not only by pure players companies but also by traditional firms. In the digital newspapers sector, online news are becoming the most required services from Internet users and their sites are among the most visited on the web. Researches on online branding of newspapers seem incomplete and there is still...
متن کاملAutomatic Detection and Banning of Content Stealing Bots for E-commerce
Content stealing in the web is becoming a serious concern for information and e-commerce websites. In the practices known as web fetching or web scraping [1], a stealer bot simulates a human web user to extract desired content off the victim’s website. Stolen content is then normally stripped of copyright or authorship information and rendered as belonging to the stealer, on a different site. T...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014